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Abstract 

Failure detectors (or, more accurately Failure Suspectors - FS) ap- 
pear to be a fundamental service upon which to build fault-tolerant, 
distributed applications. This paper shows that a FS with very weak 
semantics (i.e. that delivers failure and recovery information in no 
specific order) suffices to implement virtually- synchronous communi- 
cation (VSC) in an asynchronous system subject to process crash fail- 
ures and network partitions. The VSC paradigm is particularly useful 
in asynchronous systems and greatly simplifies building fault-tolerant 
applications that mask failures by replicating processes. We suggest 
a three-component architecture to implement virtuaUy-synchronous 
communication : 1) at the lowest level, the FS component; on top of 
it, 2a) a component that defines new views, and 2b) a component that 
reliably multicasts messages within a view. The issues covered in this 
paper also lead to a better understanding of the various membership 
service semantics proposed in recent literature. 


*The first author is on leave from Ecole Polytechnique Federate de Lausanne, Switzer- 
land. His research is supported by the “Fonds national suisse” under contract number 
21-32210.91, as part of the European ESPRIT Basic Research Project Number 6360 
(BROADCAST). The second is supported by DARPA/NASA Ames Grant NAG 2-593, 
and by grants from IBM and Siemens Corporation. 
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1 Introduction 


There have recently been several papers about membership services in asynchronous sys- 
tems [2, 12, 13, 17, 18, 19, 20]. A membership service is responsible for giving each process 
(consistent) information about the operational processes in the system. A process calls this 
information its view of the system processes. A membership service typically reacts to process 
crashes or recoveries, leading it to define a set of views. The membership services mentioned 
vary according to the underlying failure model considered, as well as the properties they 
provide with respect to the set of views delivered to each process: (e.g. whether another 
view may exist simultaneously, the degree of agreement among members): 

• [17, 18] consider processes with crash failure semantics, excluding network partitions. 

• [19, 20] consider systems in which processes may crash and the network may partition. 
However, despite network partitions, this membership service defines only majority 
views - a unique, totally-ordered sequence of views. Such a membership service is said 
to have linear semantics. 

• The membership services described in [1, 2, 13] consider the same failure scenario as 
above, but only define a partial order on the views. That is, if the system is partitioned 
in two (or more) subnetworks then two (or more) views, one in each subnetwork, may 
exist concurrently. 

Concurrent views offer an interesting extension to membership services, and force us to 
consider a further semantic distinction based on whether concurrent views are permitted to 
intersect. If two concurrent views may overlap, we say the membership service semantics 
are weak-partial , if they may not we say the semantics are strong-partial. Among those 
that permit concurrent views, [2] appears to be a strong-partial membership service. [13] 
considers both strong-partial and weak-partial membership services, and [1] and [12] consider 
only weak-partial membership service. These variants raise a new, pertinent question: when 
is a strong-partial service required, and when does a weak-partial membership service suffice. 
The objective of this paper is to suggest an answer to this question, by showing that a strong- 
partial membership service is intimately related to virtually- synchronous communication. We 
do not discuss when a linear membership service is required. 

The idea of virtually-synchronous communication (VSC) was first introduced by Isis [3, 4], 
VSC can be understood as rule for ordering message deliveries (reliable multicasts) with 
respect to view changes (received from the membership service). We give a precise definition 
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for VSC in Section 5.4. VSC defines a powerful model for building fault-tolerant processes 
that mask failures by replication. It has also been argued [5] that ordering message deliveries 
consistently around process failures and recoveries is a fundamental part of any distributed 
computation; thus VSC is a vital primitive for inherently-distributed programming. Relat- 
edly, many common distributed applications are more easily understood and solved if they 
can make use of VSC [21]. Finally, if the VSC abstraction we define in this paper is aug- 
mented with a majority requirement, [22] shows it is a powerful model in which transaction 
commit is easily (albeit probabilistically) implemented. Understanding that the VSC ab- 
straction is more basic than the transaction abstraction gives broader insight to the problem 
of building fault- tolerant applications. However, we note that solving VSC is not equivalent 
to solving consensus [10]. 

Traditionally virtually-synchronous communication has been implemented with a two com- 
ponent architecture: a membership service, and on top of it, multicast component. However, 
understanding the relationship between a membership service and virtually-synchronous 
communication has lead us to consider a three-component architecture, with (1) a Fail- 
ure Suspector component FS delivering information about the communication topology, (2) 
a View Component VC defining views , and (3) a Multicast Component MC implementing 
virtually-synchronous communication. We divide the functionality of the traditional mem- 
bership service between our FS and VC components. 

In addition to increasing our understanding of the relationship between any membership 
service and virtually-synchronous communication, this architecture allowed us to specify 
precisely the FS semantics needed to guarantee VC and MC liveness. One weakness of 
previous work in this area has been a lack of precise semantics for the FS part of the system. 

Explicitly, the paper shows: 

• that virtually-synchronous communication satisfying the definition given in Section 5.4 
can be implemented with a modular, three-component architecture for system models 
with both process crash failures and network partitions (i.e. link failures). We start 
with a very simple model, and from it construct a useful communication primitive for 
fault-tolerant, distributed applications. 

• how to define concurrent views that have empty intersections. That is, how to imple- 
ment strong-partial membership semantics in a system that may partition. The basic 
idea is to define a view as a set of pairs (proc id, proc sequence number). 
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• that if we remove the MC component from the architecture (e.g. if virtually-synchronous 
communication is not needed), then the view component defines views that do not 
satisfy the empty intersection condition (i.e. giving a membership service with a weak- 
partial semantics). 

Section 2 describes our low-level system model and the interaction of the three components. 
Section 3 gives a precise semantics for the failure suspector. Sections 4 and 5 sketch how to 
implement the VC P and MC P components, and Section 6 completes the VC P and MC P protocols. 
We conclude in Section 7. 


2 System Model 

Our low-level system model consists of an infinite name space of process identifiers, Proc — 
{PuPit }• The name space is infinite to model infinite executions in which processes 
continually fail and recover. At any point in time, however, there axe only a finite number of 
executing processes under consideration and we restrict our attention to these. For this finite 
set of executing processes, we assume a completely-connected network of FIFO channels. 
Processes communicate by passing messages over these channels, though they too may fail. 
The system has no global clock, and message transmission delays are unbounded. Processes 
fail by crashing, which we model by the local event crashp. We model the recovery of a 
process with a new identifier. A process p may (1) send a message to another process, (2) 
deliver a message sent by another process q , and (3) perform local computation. 

A history , h p , for process p is a sequence of events beginning with the event startp and 
terminating, if at all, with the event crashp : h p = startp • e* • • ■ e£, for 0 < k. A cut is an 
n-tuple of process histories, one for each p 6 Proc. We assume familiarity with inter-event 
causality [15] and with consistent cuts [8]. 

Crash failures are surprisingly difficult to handle in an asynchronous system. Fischer, 
et.al [10] show that, because it is impossible to distinguish a crashed process from one 
that is just very slow, any problem requiring “all correct processes” to agree on some value 
cannot be solved deterministically; that is, no deterministic protocol can make progress if it 
must also make accurate process failure detections. One way around this is for asynchronous 
systems to incorporate some mechanism for suspecting failures, as well as a means of han- 
dling failure suspicions consistently (e.g. p may suspect q faulty while r may not; perhaps 
r and/or q even suspect p). Our system model assumes a failure suspector that eventually 
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Figure 1: FS P , MC P , and VC P interaction for virtually-synchronous communication. 

suspects a crashed process, 1 which suffices to ensure our protocols make progress. We do 
not require anything more of the failure suspector. 

Each process has three components that interact to implement the virtually-synchronous 
communication primitive for application-layer processes (Figure 1). The Failure Suspector 
(FS p ) is at the lowest level and notifies both the Multicast Component (MC P ), and the View 
Component (vc p ) about suspected changes in the communication topology. Such changes 
arise from actual process and link failures, as well as high processor loads and heavy net- 
work traffic (indistinguishable from true failures). VC P defines p’s current view , View p (), an 
approximation of the set of processes with which p can communicate, and sends View p () to 
MC p . MC p is responsible for reliably multicasting application-layer messages until it receives 
an accessibility-change notification from FS P . These notifications signal a suspected change 
in the communication topology and the attendant need to alter View p (). However, neither 
MC P nor VC p can do this naively since virtually-synchronous communication requires that 
members of View p () that also accompany p to its next view receive the same set of messages 
that were multicast within View p () (We make this definition precise in Section 5). To ensure 
this, MC P delivers all outstanding multicasts, and does not issue new multicasts except to 
forward those that have been only partially delivered. View p () is safely terminated when all 
messages multicast in it are delivered at all sites that MC P believes non-faulty. When MC P 
detects this condition (Section 4) it informs VC P , which then determines a new view for MC P 
from the accessibility notifications it received from FS P . 

Section 3 describes the properties our Failure Suspector components must satisfy. These are 
weak yet reasonable requirements, and are easily implemented in any asynchronous system. 
Section 4 discusses VC P , and Section 5 discusses MC P . These components execute protocols 

'This can easily be implemented with time-outs. > 
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to detect global properties [8, 16]. 


3 The Failure Suspector 

Given process p, FS P emits a sequence of not-comm(^) and conun(r) suspicion messages to MC P 
and vc p . Since the system is asynchronous we cannot guarantee the accuracy or timeliness of 
these suspicions; the most we can require is that FS P eventually suspects true crashes and re- 
coveries. This is not unreasonable. It is known that fault-tolerant protocols in asynchronous 
systems cannot make progress if they are required to make accurate failure determinations. 

Our approach introduces an inaccurate failure suspector to gain liveness. On the other hand, 
we cannot require FS P to suspect all periods of transient inaccessibility - a network partition 
may repair before it is noticed. 

Since, in theory, FS P may suspect processes arbitrarily, we have divorced FS P implementation 
from the problem at hand. In a real system, FS P might take cues from the underlying 
communication layer, the operating system, response delays, and so forth. 2 

On every consistent cut c, FS P maintains two non-intersecting sets, CommSet p (c) and NotCommSet p (c). 
When FS p suspects q € CommSet p (c), q is removed from CommSet p (c) and is thereafter a 
member of NotCommSet p (c). Whenever these sets change, FS P notifies VC P and MC p by 
emitting the appropriate comm() or not-comm() messages. 

We have a reciprocity condition for (perceived) partitions, as well. To model the nature of 
network partitions, we require eventual reciprocity of inaccessibility suspicions. That is, if 
FS P suspects q then eventually either FS, suspects p or q fails. 

A logical formula holds on a consistent cut. The membership of an indexical set of processes 
depends on when it is considered. In our model, ‘when’ translates to consistent cuts, the only 
physically-realizable instances. We use the following formulas and indexical sets to specify 
the behavior of FS P . 

• NOTCOMMp(g) holds on c if q G NotCommSet p (c) 

• COMMp(^) holds along c if q £ CommSet p (c) 

• DOWN, holds along c = (hi , . . . , h g , . . . , h n ) if crash q is the last event in h g 

2 For example, to detect failures FS P could query a process, deeming it inaccessible if it does not repond 
in a timely fashion (inaccurate, but satisfying the requirement). We might put the onus on a process to 
announce its recovery. 
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• UP, holds along c = (hi, . . . , h q , . . . , h n ) if crash q is not an event in h q . 


Non-triviality Conditions for FS P 

Crashes If q crashes, then eventually either p crashes or FS P suspects q is unreachable: 

DOWN, =*> O^NOTCOMM p (g) V DOWN p j 

Recoveries If q begins executing and is reachable, then eventually either p crashes or FS P 
suspects q is reachable: 


UP, =» O^COMM p (g) V DOWNpj 


Reciprocity If FS P suspects q is inaccessible, then, if q does not crash, it eventually suspects 
p is inaccessible: 


NotComm 


,(<?) => o( 


down, V NotComm 


rfr)) 


This is an artifact of p suspecting q: since p ceases communicating with q , p is, in fact, 
inaccessible to q. 


Propagation Conditions for FS P 

Finally, we require failure suspectors to gossip among themselves. 

Inaccessibility Propagation If FS P believes, on cut c, it cannot communicate with q then 
it tries to propagate this belief to every FS r for r G CommSet p (c): 

NotCOMM p ( 9 ) =S> o(NOTCOMM r (tf) V NOTCOMM r (p)^ 

Accessibility Propagation If FS P believes, along c, it can communicate with q then it 
tries to propagate this belief to every FS r for r € CommSet p (c): 


CoMMp(^) =*► ofcoMM r (?) V NotComm, ( p)^ 
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3.1 Related work 


Before discussing the other components, we discuss the relation between this and other work. 
In [7], Chandra and Toueg solve Distributed Consensus in an asynchronous system using a 
Failure Suspector, W , that satisfies certain (weak) requirements. [6] further shows that W 
is the weakest suspector that can be used to solve Distributed Consensus. While we do not 
consider consensus in this paper we said in the Introduction that adding a majority require- 
ment to the VSC abstraction, gives a simple, probabilistic solution to transaction commit. 
Since there are no fundamental differences between solving consensus and atomic commit 
problem, how are both approaches related (we will not, hereafter, distinguish consensus from 
atomic commit)? 

First it should be clear that our Failure Suspector is not weaker than W . More important, 
[7] also places a majority requirement on processes before W can be used to solve consensus. 
To relate the two approaches, consider a generalization of consensus: 


• suppose consensus is to be solved more than once, and let consensus(i), for i > 0, be 
the i th instance of the consensus problem; 

• let Proc be the initial set of processes that solve consensual); 

• consensus(i + 1) begins only after consensus(i) has been solved; 


• for consensus(i ) , i > 1 , the processes chose their initial state randomly from the set 

{ 0 , 1 }. 


In [7], consensus(i) (for each i) would be solved by the same static set of processes Proc. The 
majority requirement to solve consensus{i ) is thus similar to a static voting scheme in the 
context of handling replicated data [11]. This is because [7] consider that failure suspicions 
are never stable: a process p believing failed(q) can always change its mind. 

In contrast, in the VSC model, failure beliefs are stable each time a new view is defined. 
Thus for i ^ j, consensus(i) and consensus(j) need not be solved by the same set of 
processes. Continuing the replicated data analogy, the majority requirement in the VSC 
model is similar to the dynamic voting scheme [9], which has been shown to lead to higher 
data availability than the static voting scheme. 
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4 The View Component 


The view component operates whenever a link failure repairs, a process begins executing, 
recovers after a crash, and whenever the multicast component informs it that the current 
view has terminated (Section 5). VC P defines p’s current view by interaction with other VC 
components, and by using FS P information. 

VC p defines a new view when it detects (or learns about through some other VC component) 
agreement on CommSet p () among the members of CommSet p (). The new view will be the 
largest subset of processes (containing p) satisfying this agreement. 

4,1 The View Component Algorithm 

In this section we outline how VC P detects or learns about CommSet p () agreement. 

When VC p is activated, it knows a near approximation of CommSet p () from FS P . Whenever 
VC p receives an corrun(r) message from FS P , it updates this approximation. Along cut c, VC P 
uses a deterministic function, vc~Coord(jp ), 4 on the set CommSet p (c) which returns a unique 
process identifier, and satisfies 

(CommSetp(c) = CommSet 9 (c)) =>• ( vc-Coord(p ) = vc-Coord(q )} . 

For example, vc-Coord[p) might be “choose the ‘smallest’ identifier from CommSet p (c). 

Each process also maintains a local counter, seq p , which is incremented every time VC P con- 
siders vc-Coord(p) to have changed (this is not necessarily every time CommSet p (c) changes. 
For liveness, however, vc-Coord(p) must change when VC P receives not-comm (vc-Coord(p)) 
from FS P ). The counter seq p is initially zero and is essential in allowing us to define non- 
intersecting, concurrent views. The tuple (p, seq p ) fully describes p on any consistent cut. 

Finally, the formula CommSetEq(S) holds on c if and only if all p € S have identical 
CommSet() sets at c. That is, 

CommSetEq(S) = f f\ (CommSetpO = CommSet 9 ()) 

p,?eS 

3 There may be notifications from FS p that have not yet reached vc p . 

“Technically, we should name some cut explicitly since the function’s value depends p’s indexical can- 
communicate-with set. We omit the cut reference, but with the understanding that vc-Coord(p) has a 
temporal dependence. In fact p never knows which particular cut it is on, but at any point in its execution 
vc p has some set of process identifiers that satisfy a certain condition. It determines a coordinator by 
applying some rule to this set. The presence of c would only clarify matters for the omniscient reasoner. 
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In our protocol, each p sends its current CommSet p () and current seq p number to vc-Coord(p) 
every time CommSet p () changes. 


4.2 Defining the New View 

Let k = vc-Coord(p), and S = CommSet^c) for some cut c. Then VC« receives CommSet p () 
for p G S. Whenever it receives a different CommSet p () from some p, VC K discards the 
previous one and checks whether COMMSETEQ(CommSet«()) holds. If it does, VC K sets the 
new view, View«(), to 

View,;() = V = |(p, seq v ) \ p G CommSet K ()| (1) 

The coordinator k then sends the new view to each VC P (for p G V) which then delivers the 
view to MC p . MC p regains execution control and begins multicasting again. Unfortunately, 
as CoMMSETEQ(CommSet K ()) is not a stable property (i.e. once true, forever true) we must 
take care in announcing the new view. We return to this issue in Section 6. 


4.3 The Partial Order 

Correctness of VC P means that the coordinator successfully sends the new view to the VC 
components of all reachable members in the new view. We will henceforth use V to denote 
the (local) view that is agreed-upon by all the members of V. 

Since process histories are linear, it makes sense to talk about the x th version of a process’s 
(local) view - we denote this by View£. 

Definition Given two agreement views V and V 7 , UX/V 7 if and only if there is some p in 
V n V' such that V = View^, and V' = View£ +1 . The transitive closure of X/ is denoted 

■ 


It is not hard to see that the views defined by the collection of VC P components are partially 
ordered by We say V and V' are concurrent if and only if they are not -related. 

Proposition 4.1 trivially follows from the definition of views (Equation 1) and the increment 
rule for seq p . 

Proposition 4.1 Let V and V' be concurrent views. Then V n V = 0. 
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5 The Multicast Component 


The Multicast Component of process p, MC P , is responsible for implementing virtually- 
synchronous communication. MC P operates in two modes. In one mode it multicasts messages 
to the members of its current view View p (). In the other mode, it flushes outstanding multi- 
casts to ensure they satisfy virtually-synchronous communication semantics, then terminates 
the current view. The transition from multicast mode to termination mode is triggered by 
any FS P not-comm() or comm() message. In this section, we define VSC semantics and the 
protocols MC p uses. 


5.1 Definitions 

Informally, virtually-synchronous communication is such that, for any view V , the processes 
of view V that mutually believe each other alive deliver the same set of multicasts . 5 To make 
the definition of VSC precise we need to define formally the set of messages considered to 
have been multicast in V, as well a s the subset of processes that deliver them. 

Definition Given a view V, message m is a V -multicast if it was sent by some p along a 
cut c such that View p (c) = V. I 

Definition (VSC) Let V^iV'. Then communication in a system is virtually-synchronous 
if and only if all processes in V and in V 7 delivered the same set of V-multicasts. Moreover 
no message is delivered in more than one view. I 

It is important to notice that process sequence numbers are not used in the definition. 
These are low-level pieces of information; the application layer should only be concerned 
with process identifiers. For an application-layer process, VSC ensures two processes that 
if they progress together from one view to another, then they delivered the same set of 
messages in the first view. As a result, if process state is determined by an initial state and 
the set of multicasts delivered to the process, VSC means that if processes begin executing 
in view V in the same state, then switch together to view V', they will begin executing in 
V' in the same state. 

5 For simplicity, we omit other forms of communication. Non-multicast communications do not introduce 
new problems. 
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5.2 Two Modes of Operation 

The component MC P operates in two modes: 

1. in normal mode MC P reliably multicasts messages issued by the application layer of p, 
and delivers to the application layer multicasts it receives from other MCs; 

2. in view-termination mode MC P does not multicast new messages; instead it attempts 
to flush outstanding multicasts to ensure the VSC semantics. 

After receiving a view from VC P , MC P is in normal mode. It enters view-termination mode as 
soon as it receives any (in)accessibility notification from FS P . When view-termination mode 
ends, MC p gives control back to VC P . MC P is inactive until it receives a new view from VC p , 
whereupon MC P begins normal mode again. 

5.3 MCp Normal Mode 

Suppose VC P defines a view V = View p () and delivers this to MC P . Recall that views are sets 
of tuples, which we call process signatures: 

View p () = { a g = (q,seq g )J. 

Upon receiving View p (), MC P enters normal mode, in which it multicasts and delivers mes- 
sages. Each message m issued by the application layer of process p is multicast by MC P to 
all q 6 V. Before issuing the message, MC p adds cr p to m. Let sender(m ) be the signature 
of the process from which m originated. 

When MC P receives a message the following sequence of events occurs: 

1. MC P delivers m (to the application layer) if sender (m) C V, and discards m otherwise; 

2. MCp also buffers any message it receives and delivers in V until it knows all other 
processes in V have received m. 6 When m is received by all processes in V we say it 
is stable. 

By delivering only V-multicasts, the normal mode ensures that no multicast can be delivered 
in more than one view (see the VSC definition). 

6 There are many standard ways of achieving this - e.g. piggybacking information on messages. 
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5.4 MCp View- Termination Mode 

Consider a view V = View p (). Component MC P switches from normal mode to view- 
termination mode after receiving from FS P either 1 ) not -coming) for q € View p (), or 2 ) 
comm(r) for r £ View p (). This is because whenever a change in the communication topology 
is detected a new view must be defined reflecting that change. However, before defining a 
new view, MC in view-termination mode must ensure the VSC definition is satisfied. 

Once MC P enters view-termination mode, it need only consider relevant not-comm() events 
from FS P to terminate V . Thus, while executing in view-termination mode, MC P builds its 
own approximation of NotCommSet p (). This means failure notifications have a permanent 
effect until view-termination mode ends: 00111111(9) received by MC P in view-termination mode 
after not-coram(9) (for example due to a partition) cannot undo the not-comm(9) information. 

Just as a new view for p is defined according to agreement on CommSetQs, successfully 
terminating V involves partitioning V according to NotCommSetQ agreement. 

Definition The indexical set Survives p (V r ) is V minus the set of processes MC P believes failed 
in V : 

Survives p (V) = V — {(q,seq q ) | NotComm p (<7)| 


Before we can explain how to ensure VSC, we need the following data structures. 

Definition Consider V = View p () and consistent cut c. The vector msg p (V, c ) (of size | V |) 
is defined such that: 

• its p t ^ 1 component, msg p (V, c)[p], is the number of V-multicasts that originated from p 
(up to c); 

• for q € V, q ^ p, its q th component, msg p (V, c)[<7], is the number of V-multicasts MC P 
delivered up to c that originated from q. I 

Definition (View Terminated) Consider view V and S such that 0 / S C 7 ds(V) (where 
Ids(V) is the set of process identifiers appearing in V ). Then VT(V, S) holds along cut c if 
and only if 

/\ ^(jnsg p (V, c) = msg q (V, c)) A (Survives p (V, c) = Survives,(V, c))) 
p,?eS 

It is not hard to see S = 7ds(Survives p (V)). 1 
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In other words VT(V,S) is true exactly when the processes in S agree on both the messages 
multicast in V and on their respective Survives(V / ) sets. For MC P , detecting termination of 
V = VieWp() is thus reduced to detecting VT(V,S) (for p £ SC Ids(V)). 

Having detected VT(V,S), whether S = Ids(V ) or S C Ids(V ) is important in determining 
the new view. In the first case, whatever view, V 7 , VC P later defines, VSC is satisfied with 
respect to the pair (V,V r ). In the second case MC P must pass Survives p (V) to vc p ; we will 
want the new view to be a subset of Survives P (V). 

To guarantee that every non-crashed process in V eventually detects VT(V, S) for some S, 
MC P behaves as follows in view-termination mode: 

• it stops multicasting new messages; 7 

• it rejects any message m such that sender(m) (fc Survives p (V). 

• upon receiving not-comm(g) from MC P (for q £ V), MC P signs and forwards any V- 
multicasts originating from q that are still in p’s buffer (Section 5.3). MC P then removes 
these messages from its buffer. MC, rejects the re-issued message if NOTCOMM 9 (p) 
holds (i.e. if MC ? has received not-comm(p) from FS ,). 8 

Proposition 5.1 Consider view-termination mode as described above. Then for eachp £ V , 
there exists a set, S p such that p £ S p and VT(V, S p ) holds. 

PROOF (sketch) We introduce the following notation: 

• VTj(V,S) = f A Pf , e S ™sg P (V) = ms<7,(V) 

• vt 2 (V,S) d = A Pi96 $ Survives p (V) = Survives,(V)) 

Consider p £ V. We build a sequence S p , . . . , S p , . . . , S p , where Vi, p £ S p and S p C Ids(V), 
such that finally vt(V, S p ) holds. Initially take S p = Ids(V). The proof ends as soon a is 
vt(F, S p ) holds, for some i. If not, then VTi(V,S p ) or VT 2 (V,S p ) does not hold. We obtain 
S p +1 from S p by removing a process (if necessary). Because (1) S p is finite, (2) the number of 
messages sent in a view is finite once view-terminaton mode is started (processes do not issue 

7 If the network were a broadcast domain, MC p could continue multicasting using a new signature (p, seq p + 
1). The problem for less general environments is that the new multicast view (destination set) is not yet 
known. 

8 Duplicate messages are recognized and discarded as usual. 
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new multicasts in this mode), and (3) VT(V,{p}) is trivially true, the construction finally 
ends with SJJ such that VT(V,SJJ) holds. We briefly discuss the proof reasoning for the case 
when either VTi(V,S‘) or VT 2 (R, SJ,) does not hold. 

(a) If VTi ( V, SJ,) does not hold, then the message set of some q in SJ, differs from p’s, in some 
component^, r G SJ, : (msp p (R)[r] 7^ m5 fl'g(^ /r )[ r ])- ^ eventually these sets become equal, 
then take Sj, +1 = Sp. If not (i.e., msp p (R)[r] never equals msp,(V)[r]), then either DOWN,, 
or NOTCoMM r (p), or NoTCOMM r (g) holds. So suppose NoTCoMM r (p) holds (analogous 
arguments hold for NOTCOMM r (g) and DOWN,). Then eventually NOTCOMM p (r) holds 
(from FS p Reciprocity). The Reissuing rule in view-terminaion mode means that p will 
forward to q all messages it received from r that q did not. However, since the message 
sets never agree this transfer will not succeed completely before NOTCoMM 9 (p) eventually 
holds. Reciprocity ensures that NOTCOMM p (g) holds, and at this point we define SJ, +1 to be 

s; -{<?}• 

(b) If VT 2 (V", Sp) does not hold, then there is some q in SJ, such that Survives p (R) 7^ 
Survives g (y). Without loss of generality let r € Survives,(R) - Survives p (V). Then In- 
accessibility Propagation and Reciprocity mean that eventually either NOTCOMM,(r), or 
NotComm p (<7) holds. In the first case SJ , +1 to be SJ, — {9}; in the second case, take 
S ,+1 = S‘ I 

°p "V 

5.5 An Algorithm to Detect vt(V, S) 

Like the VC P algorithm detecting CommSetEq(), the MC P algorithm detecting VT(R,S p ) 
relies on a coordinator process. MC P determines its view-termination coordinator with a 
deterministic function, mc-Coord{p ), on the set Survives p (V, c). We require that for p and q 
in V , with identical Survives(R) sets, mc-Coord(p) = mc-Coord(q). 

Let x = rnc-Coord(p). Then x attempts to detect VT(V, Survives x (R)). MC P also increments 
the sequence number counter, seq p , whenever MC P considers mc-Coord(p ) to have changed (for 
liveness, the function mc-Coord(p) must change whenever MC P receives not-comm (m.c-Coord(p)) 
from FS P ). 

Process p sends msg p (V), Survives p (R), and seq p to mc-Coord(p) when MC P first considers 
mc-Coord(p) to be its coordinator, and whenever msg p (V) and Survives p (V / ) are modified. 
Ifx = mc-Coord(p), then: 

VT(V,Survives(V)) /\ ^msp x (F) = msg p {V) A Survives x (R) = Survives p (R)^ 

P eSurvives x (V) 
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Proposition 5.2 Consider a view V , with p € V and the view-termination protocol de- 
scribed above. Then eventually, either p crashes or it detects VT(P, Survives x (P)). 

PROOF (sketch) The proof is similar to that of Proposition 5.1. Here, we consider the 
perspective of y = mc-Coord(p). The problem is that, due to transmission delays, y may 
not detect vt(P, Survives x (P)) as soon as it holds (transmission of msg p (V ) and Survives p (P) 
from p to y). 

There are two cases: eventually y receives the messages enabling it to detect VT(y, Survives x (P)), 
or failures prevent x from detecting it. In the second case, if both COMM p (y) and COMM x (p) 
hold, we can use the iterative construction, from the perspective of y, in the proof of Propo- 
sition 5.1. Otherwise we must consider the iterative construction with respect to y', the 
coordinator replacing y once it is no longer a member of Survives p (P). I 

Finally, the fact that VT(P, S) is not stable poses the same problems as those posed by 
CommSetEqO’s instability. We consider both in the next section. 


6 Instability of CommSetEq() and vt(V,S) 

As described in the previous sections, once VC P learns CoMMSETEQ(CommSet p ()) it switches 
control to MC P ; switching control from MC P to VC P is based on detecting VT(View p (), S). 
In both cases, the relevant property is not stable - it may become false after holding 
along some cut. Let switch(vc, V') be the message announcing the new view, P', and 
switch(MC, Survives()) be the message announcing termination of view P. 

Since neither CommSetEq(S) nor VT(P, S) are stable properties, we can arrive at the fol- 
lowing situation 9 : 

• Take p,q 6 P such that p and q believe each other accessible, and let k be their mutual 
VC coordinator ( k = vc-Coord(p) = vc-Coord(q)). Suppose VC* determines the new 
view, V' (n,p,q 6 V'), sends switch(vc, V') to p only, and then crashes. VC P , upon 
receiving switch(vc, V'), adopts View p () = V' and switches control to MC P in normal 
mode. 

• Now suppose that in addition to VC g not getting switch(vc, P'), FS 9 notifies VC g that 
k is inaccessible; q continues executing in VC, waiting for some new coordinator k' to 

9 While we illustrate instability with CommSetEq() and the switch from vc p to MC P , a similar situation 
arises for VT(V,S) as well. 
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inform it of the new view. In particular, suppose k' — p. 

• Since p and q continue to believe each other accessible, FS, gossips not-comm(«:) to FS P . 
At this point, MC P enters view-termination mode for view View p () = V 7 , and q is still 
executing in VC, waiting to receive the successor view to V . Observe that unless one 
of the processes crashes or a network partition splits them, p and q need never believe 
each other inaccessible. 

• For VC, to make progress, its coordinator VC P must tell it some new view. Unfor- 
tunately, VC p cannot begin executing until MC P leaves view-termination mode. MC P 
cannot leave view-termination mode until it receives Survives, () from MC, (after all, 
q £ V' and q £ CommSet p Q). In other words, p and q are deadlocked because their 
execution controls are out of phase. The control discrepancy prevents either one (VC, 
or MC P ) from making progress until one of them believes the other inaccessible - q is 
stuck in VC,, and p is stuck in MC P . 

While processes being out of phase is not always destructive, and in fact is quite natural 
whenever partitions occur, it is destructive in this case since it induces deadlock. The 
following precludes deadlock. 

6.1 Component-Switch Protocol 

Let k be shorthand for vc-Coord(p) when VC P is executing. We describe the protocol only 
for the switch from VC P to MC P ; the situation is analogous for the reverse switch. Let 
V = View p (). We define the following concepts as depicted in Figure 2: 

• From Section 4, each accessibility notification from FS P forces VC P to inform its coor- 
dinator VC* of the change to CommSet p (). Let VC-alertp( ) denote the message vc p 
sends to VC* to inform VC* of the change to CommSet p (). 

• Let FS- VC-Notify p {V') be the set of not-comm(g) and comm(r) accessibility notifications 
VC P received from FS P after sending its first CommSet p () to any coordinator and before 
receiving switch(vc, V') from VC*; 

So given V' and FS-VC-Notify p (V' ), VC P can infer which VC-alertp{ ) messages reached VC* 
before it detected CoMMSETEQ(CommSet*()) and which did not. Let FS-VC-Late p be the 
subset of FS- VC-Notify p (V') for which the corresponding VC-alert p () message did not reach 
VC*. 
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VCp 


VC* 


FSp 


not-comm(r) 


comm(.s) 


not-conuu(q') 



CommSetp() 


VC-alertp(-r) 


VC-alertp(+s) 


CoMMSETEQ(CommSet«()) 


switch(vc, V') 


VC-alertp(-q) 


Figure 2: FS-VC-Notify p (v') (lightly-shaded rectangle), VC-alertp(), FS-VC-Late p (darkly- 
shaded rectangle) 
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The Component Switch protocol for VC P is: 

1. The coordinator k sends the switch(vc, V') message using a best effort reliable multi- 
cast [14] (a process receiving the message reissues it to all the destination processes). 

2. Upon receiving switch(vc, V 7 ), VC P : 

(a) logically reorders it to be before VC P sent any of the messages in FS-VC-Late v (this 
will be clearer after 3) ; 

(b) installs V' as View p () and switches control to MC P , in normal mode; 

3. MC p handles messages in FS- VC-Late v as if the corresponding notifications from FS P had 
just arrived (i.e. while MC P is executing, and not while VC P was executing). Specifically, 
MC P simulates receiving these accessibility notifications in View p () = V' . 

Proposition 6.1 The Component-Switch Protocol prevents deadlock. 

PROOF (sketch) We restrict this discussion to a process p, in view V, switching from its VC P 
to MC P component, and suppose q € V. Suppose q never switches from VC, to MC P in view 
V. We show this does not prevent p from later switching from MC P back to VC p . 

Because p switches to MC P in view V, p has received the switch(vc, V) message. By the 
Component-Switch Protocol, p has reissued switch(vc, V) to q. Then either: 

1. q never receives switch(vc, V), or 

2. q receives switch(vc, V) after having already switched to MC, in view V 7 , with V' ^ V. 

In the first case, NOTCoMM,(p) holds eventually. In the second, p € V' contradicts p € 
V. Thus, NOTCOMM,(p) holds, and FS P Reciprocity means eventually either p crashes or 
NotComm p (^t) holds. Once NotCoMM p (<7) holds, p’s progress (i.e. switching back to VC P ) 
is decoupled from q ' s progress; q cannot be responsible for blocking p. I 


7 Concluding Remarks 

This paper has shown how to implement virtually-synchronous communication using a three- 
component architecture for systems that experiences process crash failures and network par- 
titions. The three-component architecture lead us to define a clear semantics for a Failure 
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Suspector (a necessary part of any live, asynchronous system) that guarantees liveness of the 
VC and MC components. Clearly defining these semantics allows one to implement the Fail- 
ure Suspector as a modular tool - distinct from all other components - whose implementation 
can take advantage of the characteristics of the underlying network. 

Considering a membership service in relation to virtually-synchronous communication also 
lead us to better understand the need for a strong-partial compared to a weak-partial member- 
ship service. Specifically, a strong-partial membership service (non-intersecting concurrent 
views) is naturally related to virtually-synchronous communication. We can understand this 
in the following way. The MC component must identify the sender of a message by its signa- 
ture Cq to ensure that no multicast is delivered in more than one view. This led us to define 
a view as a set of process signatures. Considering the increment conditions of seq py two dif- 
ferent views V and V 1 trivially have a non-empty intersection. In other words, by requiring 
that no multicast be delivered in more than one view, we were led to the partial-strong mem- 
bership service. However if we remove the MC component, (i.e. if the membership service 
is only defined by FS and VC, without any reference to communication), then the sequence 
number seq p has no clear justification. In that case, a view is just a set of process identifiers 
(or a set of identifiers and an incarnation number). With this definition, the same VC proto- 
col we described would define concurrent views that overlap, providing only a weak-partial 
membership service. 
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